Text similarity in academic conference papers

نویسندگان

  • Jun-Peng Bao
  • James A. Malcolm
چکیده

If we are to use electronic plagiarism detectors on student work, it would be interesting to know how much similarity should be expected in independently written documents on a similar topic. If our measure is coarse, the answer should be zero, but a finer grained analysis (such as would be needed to detect inadequate paraphrasing) is likely to detect some background noise. How much background noise should there be? We would like to determine this, but it is hard to publish research based on analysis of student work, because we cannot know whether any particular pair of students worked completely independently or not, and in any case the results might attract unwelcome publicity. To get an estimate of an appropriate level of this background noise, we analysed submissions to an international conference using the Ferret plagiarism detector developed by Lyon et al. (2001). Ferret provides very fast and fine-grained similarity detection in moderately large collections of documents. This was an exercise intra-corporal or collusion detection rather than comparison to Web sources. For this purpose the Ferret algorithm is well suited. There were 483 files; scanning the files took about 50 seconds, and calculating the similarity statistics took about 10 seconds. There were 116403 file pairs. Of these pairs, only 116 (0.1%) had more than 99 common triples, and of these only 19 pairs (0.016% of the total) had over 200 matching triples (200 is about 10% of the typical size of the smaller files). There should be NO plagiarism here, as these are published conference papers, but in fact the top few are all pairs of papers with common authors, and they have re-used text. A simple MS Word file compare between one of the top ranking pairs is sufficient to make the similarities obvious (though Word does not highlight all the similarities by any means). Nevertheless, as expected, most document pairs showed very low similarity measures, and this was consistent across the vast majority of pairs. As noted, there was a surprisingly large degree of similarity in just a few cases. We accordingly investigated these pairs more carefully. The worst case was of an author who had submitted two papers. Each paper reported the results of a single experiment, but the background material for both experiments was very much the same and he had simply reproduced the same text in both papers. We present also the other cases where similarity was high, and ponder the implications for routine scanning of student work. Corresponding author: James.A.Malcolm, Division of Computer Science, University of Hertfordshire, Hatfield, Hertfordshire, AL10 9AB. Email: [email protected]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Linguistic Analysis of Conference Titles in Applied Linguistics

Over the past twenty-five years, researchers have expressed considerable interest in titles of academic publications. Unfortunately, conference paper titles (CPTs) have only recently begun to receive attention. The aim of this study, therefore, is to investigate the text length, syntactic structure, and lexicon of CPTs in Applied Linguistics. A data set of 698 titles was selected from the 2008 ...

متن کامل

A Linguistic Analysis of Conference Titles in Applied Linguistics

Over the past twenty-five years, researchers have expressed considerable interest in titles of academic publications. Unfortunately, conference paper titles (CPTs) have only recently begun to receive attention. The aim of this study, therefore, is to investigate the text length, syntactic structure, and lexicon of CPTs in Applied Linguistics. A data set of 698 titles was selected from the 2008 ...

متن کامل

Co-Bidding Graphs for Constrained Paper Clustering

The information for many important problems can be found in various formats and modalities. Besides standard tabular form, these include also text and graphs. To solve such problems fusion of different data sources is required. We demonstrate a methodology which is capable to enrich textual information with graph based data and utilize both in an innovative machine learning application of clust...

متن کامل

5 th Symposium on Languages , Applications and Technologies

The information for many important problems can be found in various formats and modalities. Besides standard tabular form, these include also text and graphs. To solve such problems fusion of different data sources is required. We demonstrate a methodology which is capable to enrich textual information with graph based data and utilize both in an innovative machine learning application of clust...

متن کامل

Personalized Academic Research Paper Recommendation System

A huge number of academic papers are coming out from a lot of conferences and journals these days. In these circumstances, most researchers rely on key-based search or browsing through proceedings of top conferences and journals to find their related work. To ease this difficulty, we propose a Personalized Academic Research Paper Recommendation System, which recommends related articles, for eac...

متن کامل

Identifying and Classifying Students\' Academic Misconducts (Systematic Review)

Background: Planning for the decrease of scientific misconducts among students requires the recognition of subjects and relevant cases. The aim of the current study is determining the categories of scientific immorality among university students and categorizing them. Method: The current study is in the category of descriptive studies and data gathering is done using the systematic review meth...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006